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Abstract 

Background: For the development of genome assembly tools, some comprehensive and efficiently computable 
validation measures are required to assess the quality of the assembly. The mostly used N50 measure summarizes the 
assembly results by the length of the scaffold (or contig) overlapping the midpoint of the length-order concatenation 
of scaffolds (contigs). Especially for scaffold assemblies it is non-trivial to combine a correctness measure to the N50 
values, and the current methods for doing this are rather involved. 

Results: We propose a simple but rigorous normalized N50 assembly metric that combines N50 with such a 
correctness measure; assembly is split into as many parts as necessary to align each part to the reference. For 
scalability, we first compute maximal local approximate matches between scaffolds and reference in distributed 
manner, and then proceed with co-linear chaining to find a global alignment. Best alignment is removed from the 
scaffold and the process is iterated with the remaining scaffold content in order to split the scaffold into correctly 
aligning parts. The proposed normalized N50 metric is then the N50 value computed for the final correctly aligning 
parts. As a side result of independent interest, we show how to modify co-linear chaining to restrict gaps to produce a 
more sensible global alignment. 

Conclusions: We propose and implement a comprehensive and efficient approach to compute a metric that 
summarizes scaffold assembly correctness and length. Our implementation can be downloaded from http://www.es. 
helsinki.fi/group/scaffold/normalizedN50/. 



Background 

In de novo genome assembly (see e.g. [1]), the result is usu- 
ally a set of strings called scaffolds. These are DNA strings 
containing runs of Ns denoting gap regions of estimated 
length. The parts separated by N-runs are contigs. A typi- 
cal measure to assess how well the assembly has succeeded 
is N50 measure, which equals the length of the scaffold 
(or contig) overlapping the midpoint of the length-order 
concatenation of scaffolds (contigs). If there is a reference 
genome available (as in assembly challenges, see http:// 
assemblathon.org [2] and http://gage.cbcb.umd.edu/ [3]), 
one can align the scaffolds to reason about the accuracy. 
With contig assemblies one can report a normalized N50 
value that takes into account only the parts of the assembly 
that can be aligned to the reference using standard local 
alignment tools. With scaffold assemblies, normalization 
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is more complex, as local alignment results for the contigs 
need also to be chained to form global alignments for the 
scaffolds. This chaining is typically done using heuristic 
approaches. Only very recently more attention has been 
paid for this problem: In [2] they build multiple align- 
ment for the scaffolds and reference, and represent it as 
an adjacency graph where there are edges for representing 
aligned contigs and for adjacencies proposed by scaffolds 
(and some other types, see [2] for details). Then one can 
look at maximal paths that alternate between these two 
types to form scaffold paths. All such maximal scaffold 
paths can be extracted and used in the computation of 
normalized N50 value (called scaffold path NGSO in [2]). 
While this is a diligent approach given an adjacency graph, 
the overall approach is highly dependent on heuristics to 
compute the multiple alignment, which is a very challeng- 
ing computational problem to be solved exactly. This is 
especially true now that the multiple alignment tool used 
needs to cope with rearrangements in order to be able 
to align the partly incorrect scaffolds correctly. Although 
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there exists effective tools (see e.g. [4]) even for this hard- 
est instance of multiple alignment, the question is if there 
is an alternative approach that can completely avoid hard 
computational subproblems. 

We propose a much simpler but still rigorous approach 
to compute normalized N50 scaffold assembly metric that 
combines N50 with correctness measure; in principle, 
assembly is split into as many parts as necessary to align 
each part to the reference. For example, let reference be 
GTAAGGCGAGGCTGAGAGT and let the assembly consist 
of two scaffolds CTGNNNGT and AGAGTANNNNGAGG, with 
N50 being 14. If we split the assembly into CTGNNNGT, 
AGA, and GTANNNNGAGG, then each piece aligns perfectly 
and the normalized N50 is 8. We show that this process 
can be modeled by three well-defined subproblems, each 
of which has an efficient and exact algorithmic solution. 

In more detail, one needs to allow mismatches and 
indels in the alignment so that only the real structural 
errors in the assembly are measured. Moreover, the gaps 
between contigs in a scaffold may not be accurate due to 
variation in insert sizes of the mate pair reads used for 
the scaffold assembly. Taking these aspects into account, 
it would be easy to construct a dynamic programming 
recurrence to find the best scoring alignment for a scaf- 
fold, allowing gaps (N-runs) to vary within given thresh- 
olds. However, the running time would be infeasible for 
any practical use; one iteration would take at least 0(mn) 
time, where m is the total length of scaffolds and n the 
length of the reference. 

We propose a practical scheme of computing an approx- 
imation of the normalized N50 metric using the com- 
mon seed-based strategy: First compute all maximal local 
approximate matches between scaffolds and reference, 
then chain those local alignments that retain the order 
both in reference and in each scaffold. This approach is 
called co-linear chaining [5]. As it was originally devel- 
oped for RNA-DNA alignment, there was no need for 
restricting gaps in the chains, as introns can be almost 
arbitrary long. For our purposes, DNA-DNA alignment, 
it makes sense to restrict the length of the gaps between 
consecutive local alignments, as gaps should not be much 
longer than the insert size of mate pair reads. Finally, 
this alignment is repeated extracting the largest correctly 
aligning part from each scaffold in each step. We note that 
our approach is rigorous in the sense that we can avoid 
heuristics in each of the three subproblems considered 
(see the discussion in the end). 

In what follows, we assume that local alignments are 
given, and first concentrate on modifying co-linear chain- 
ing for the case of restricted gaps. Then we proceed in 
explaining our implementation of the normalized N50 
computation incorporating the local alignment compu- 
tation with gap-restricted co-linear chaining. We then 
give our results on an experiment demonstrating how 



normalized N50 can characterize good and bad scaffold 
assemblies. Discussion follows on other possible uses and 
variations of the method proposed. 

Methods 

Let us assume that all local alignments between scaffold 
and reference genome have been computed, and we have 
a set of tuples V = {(x,y,c,d)} such that matches 
P[ c, d], where T[ 1, n] is the genome and P[ 1, m] the scaf- 
fold. In co-linear chaining the goal is to find a sequence of 
tuples S = s\S2 - - - s p e V p such that Sj.y > Sj-i.y, sj.d > 
Sj-\.d, for all 1 <j <p, and \ {i | i e[sj.c,Sj.d] for somel < 
j < p}\ is maximized. That is, find tuples preserving order 
in both T and P such that the resulting ordered cover- 
age of P is maximized. We review an efficient solution 
given in [5] and extend it for our purposes. First, sort 
tuples in V by the coordinate y into sequence v\v<i • • • vm. 
Then, fill a table C[ 1 . . . N] such that C[j] gives the maxi- 
mum ordered coverage of P[ 1, vj.d] over the choice of any 
subset of tuples from {vi, V2, . . . vj}. Hence max/C[/] gives 
the total maximum ordered coverage of P. Then one can 
derive the following formulae for C[j] [5] depending on 
the case: (a) Either the previous tuple Vf does not overlap 
Vj in P; or (b) the previous tuple Vf overlaps Vj in P. For (a) 
it holds 

C a [;]= max C[/]+(v ; -^-v 7 -.c+l). (1) 

j':Vji .d<Vj.c 

For (b) it holds 
C b [;]= max C[f] +{Vj.d - v f .d). (2) 

j':Vj.c<Vj/ .d<Vj.d 

Then the final value is C[j] = max(C a [;] , C b [;] ). Now 
we can modify the formulae taking the invariant values 
out from the maximizations to obtain range maximum 
queries. These can be solved using a search tree T that 
supports in 0(\ogn) time operations Insertiy, i) to add 
value v to the tree with key i (if key i is already in the tree, 
replace its value v' with max(v, v')); Delete(i) for deleting 
node with key /; and v = Maxil, r) to return the maxi- 
mum value v from nodes {/} that belong to the interval 
I < i < r. Since there are two different maximizations, 
we need to maintain two different search trees. Notice 
that applying the recurrence directly would yield a trivial 
0(N 2 ) time algorithm, whereas the use of invariant and 
search tree gives 0(N log N) time. The resulting pseu- 
docode (analogous to one in [5]) is given below. 

Algorithm CoLi near Chaining ( V sorted by 
y- coordinate : v\, V2, • • • , vjq) 

(1) T Jnsert(O f 0); 1 Jnsert(0, 0); 

(2) for ; ' ^ 1 to N do 

(3) C a [;] <- (Vj.d - VjX + 1) + TMax(0 9 Vj.c - 1); 
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(4) C b [ ;] <r- Vj.d + IMaxtyx, Vj.d); 

(5) C[;]^max(C a [;],C b [/]); 

(6) T. Insert (C[j], Vj.d); 

(7) 1. Insert (C[j] —Vj.d, vj.d); 

(8) return max/ C[/]; 

The alignment given by applying the above algorithm 
allows arbitrary long gaps, which is not a desirable feature. 
The gaps between consecutive contigs in scaffolds are 
restricted by the mate pair insert size, which also tells that 
in a correct alignment to the genome the gaps should not 
deviate much from this value. It is easy to modify co-linear 
chaining to restrict gaps: Replacing TMaxiQ, Vj.c— 1) with 
TMaxiyj.c — maxgap, Vj.c — 1) at line (3) in the pseu- 
docode restricts the gap in the scaffold by maxgap. To 
obtain analogous effect simultaneously in the reference 
genome, is a bit more tricky. Let us first describe a method 
that works in the special case that Vj.y — Vj.x are equal for 
all / and then consider the modifications required to han- 
dle the general case. For the special case, one can deploy 
T.DeleteQ as follows: At step j of the algorithm, main- 
tain the invariant that T only contains all tuples vy having 
Vj/.y > Vj.x — maxgap and / < This is accomplished by 
adding the following code between lines (2) and (3) and 
initializing/ = 1: 

(3') while Vj.x — maxgap < Vf.y do 

(3") T.Delete(vy.d);f <- / + 1 

(For simplicity of exposition, this assumes values vy.d 
are unique keys. One can use e.g. tuples (vy.d,/) as keys 
to ensure uniqueness.) The correctness for the special 
case follows as Vj-\.x < Vj.x for all j > 2, and one 
can thus delete values incrementally so that the invari- 
ant is satisfied. The method fails in the general case since 
we can have Vj-\.x > Vj.x and tuples with j-coordinate 
between [ Vj.x— maxgap, Vj-\.x— maxgap] are deleted. To 
overcome this, one can modify the algorithm as follows. 
Duplicate tuples and use ^-coordinate for one copy and y- 
coordinate for the other as the sorting key. Now each tuple 
has left and right occurrence in sorted V. Apply the above 
algorithm, but do deletions only on left occurrences. In 
addition, on left occurrences, compute C[j] with lines 3- 
5 in the algorithm above, add the pair (v/, C[j] ) in a list 
of active tuples V instead of applying lines 6-7 above. On 
right occurrences, update C[j] again but before lines 6- 
7 above, take the maximum of that value and the one 



stored for active tuple Vj in V. Then remove Vj from V 
and recompute C[f] for all active tuples vy in V choosing 
as C[f] the maximum of its previous value and the value 
computed applying lines 3-5 in the algorithm above. The 
correctness now follows from the facts that (a) when Vj is 
added to the active tuples V, C[j] is the maximum value 
without overlapping tuples vy taken into account, and (b) 
all the overlapping tuples vy with Vj.x < vy.y < Vj.y have 
their right occurrence before that of Vj and hence trigger 
the update of active tuple (v/, C[j] ). 

Results 

We used swift [6] for producing local alignments: The 
program takes as parameters the minimum match length 
(mini en) and the maximum error level (maxerror) as 
a percentage determining that at most maxerror xL edit 
errors can be in a match of length L >minlen. It then 
finds all maximal local alignments satisfying the parame- 
ter constraints. The process was distributed so that scaf- 
folds were partitioned into equal size chunks and each 
chunk allocated to a different cluster node. 

The rest of the process (co-linear chaining, extraction 
of alignments, computation of N50) was executed on a 
single machine. To compute the normalized N50 value, 
the process was hence to apply co-linear chaining itera- 
tively, always extracting the best alignment and splitting 
the scaffold accordingly. The process was repeated until 
all pieces (that had a local alignment in the first place) 
found their matches. The N50 of the pieces obtained this 
way is then called the normalized N50. Reverse comple- 
ments were taken account appropriately; scaffolds were 
aligned to both strands and only contig alignments with 
the same orientation were combined to form a scaffold 
alignment. 

We have already used normalized N50 in [7] to compare 
different scaffolders. We report here an experiment that 
gives some more insight to the normalized N50 measure: 
We created a varying number of random intra-scaffold 
contig swaps to an assembly MlP-elegans in [7] and com- 
puted normalized N50 for each variant. This gives a 
sampling between good assembly and completely random 
assembly such that scaffold N50 stays the same in all ver- 
sions, but accuracy of the assembly should drop. One 
can see from Table 1 that normalized N50 indeed reflects 
this expected behaviour. The percentages in Table 1 give 
the amount of contigs translocated. Coverage values are 



Table 1 Normalized N50 on original assembly versus varying randomized assemblies with the same N50 

Original 10% 20% 30% 50% 100% 

Normalized N50 183891 92212 56964 43461 33533 30403 

Genome coverage 0.9333 0.6410 0.4778 0.4153 0.3421 0.3311 

Scaffold coverage 0.9859 0.6847 0.5071 0.4414 0.3642 0.3522 
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Figure 1 An example where greedy extraction of gap-restricted 
co-linear chains may result into more pieces than optimal. 

Greedy selection would align blocks 2, 3, 4 with dashed edges, but 
then with suitable gap-restriction blocks 1 and 5 could not be aligned 
together, and the assembly would be split into 3 parts. Optimal 
algorithm can choose 2 and 4 with dashed edges and then blocks 1 , 
3, 5 together, resulting into 2 parts only. It is possible to construct 
such an example even without multiple mappings for the blocks. 



computed after the first iteration of co-linear chaining. 
The reference genome is Caenorhabditis Elegans of length 
100.3 Mbp. The assembly was produced by the MIP 
Scaf folder of [7] and has N50 value 189704. 

For the experiment we ran the validate_distributed.sh 
script of our tool with parameters maxerror 0.02, 
minlen 35, maxgap 5000 and numjobs 120. Here the 
two first parameters are the ones for swi f t explained ear- 
lier. Third is used for restricting gaps in co-linear chaining, 
and the last is for distributing the heaviest part of the 
computation (local alignments). The 120 swi ft jobs were 
distributed on 20 machines taking overall 115 minutes for 
one run. The rest of the computation took 3 minutes on a 
single machine. 

Discussion 

The proposed method should also work for validating an 
RNA assembly against a DNA reference, by just setting 
the maximum gap length to the maximum possible intron 
length. Also one could use it for whole genome com- 
parison between two species, by considering how many 
pieces one genome needs to be partitioned in order to 
align to the other. Such measure is not very accurate as 
it does not model a sequence of evolutionary events to 
explain the transformation, like the genome rearrange- 
ment distances, but the approach gives the number of 
breakpoints which can be used as a lower-bound. How- 
ever, much more elaborate tools for that purpose have 
been developed [8]. 

We stress that our approach has also some con- 
ceptual value in avoiding unnecessary heuristics. The 
three main steps (i) finding maximal local alignments, 
(ii) co-linear chaining, and (iii) splitting the scaffolds, 
have each an algorithmically correct solution. For (i) 
and (ii) one can refer to [5,6], as well as for the gap- 
restricted case covered in this paper. For (iii) one can 
refer to the folk theorem that greedy splitting of a 
string into maximal aligning pieces is optimal strat- 
egy if one wants to minimize the number of aligning 
pieces; this extends to the case of extracting aligning 
pieces from scaffolds greedily. It is interesting that actu- 
ally with the gap-restricted co-linear chaining, this folk 
theorem does not hold anymore, see Figure 1. This 
leaves the open question whether there is an efficiently 
computable optimal strategy for splitting in this special 
case. 

Finally, the approach in [2] is especially designed for 
evaluations where the reference consists of two haplo- 
types. Our approach is straightforward to modify for this 
case: The two haplotypes can be concatenated and used 
as the reference sequence to our program. This way the 
scaffolds will be split to parts that match one of the 
haplotypes only and the evaluation does not favor 
assemblies whose contigs or scaffolds alternate between 



haplotypes. On the other hand, obtaining assemblies that 
would separate the two haplotypes is quite unlikely with 
just short read sequencing data. It is also as easy to modify 
our approach for the case where haplotypes are allowed to 
mix: Assuming that the pair- wise alignment of haplotypes 
is known (which is the case with artificial data generated 
for evaluations), one can do the first step of our approach 
(maximal local alignments) separately for each haplotype, 
then project all the local alignment results to one haplo- 
type using the known pair-wise alignment. After this the 
chaining allows haplotypes to mix. 

Conclusions 

We proposed and implemented a comprehensive and effi- 
cient approach to compute a metric that summarizes 
scaffold assembly correctness and length. Our implemen- 
tation can be downloaded from http://www.cs.helsinki.fi/ 
group/scaffold/normalizedN50/. 
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